Basic Statistics and Projects in R

Data visualization with the tidyverse

Christian Althaus, Judith Bouman, Martin Wohlfender

Fundamentals

Claus Wilke, a professor of integrative biology at The University of Texas at Austin, wrote this book as a guide to

  • making visualizations that accurately reflect the data,
  • tell a story,
  • and look professional.

Note that the entire book was written in R Markdown using RStudio!

Ugly, bad, and wrong figures

  • ugly - A figure that has aesthetic problems but otherwise is clear and informative.
  • bad - A figure that has problems related to perception; it may be unclear, confusing, overly complicated, or deceiving.
  • wrong — A figure that has problems related to mathematics; it is objectively incorrect.

Aesthetics

All data visualizations map data values into quantifiable features of the resulting graphic. We refer to these features as aesthetics.

Coordinate systems

Coordinate systems don’t have to be Cartesian.

Color scales

There are three fundamental use cases for color in data visualizations:

  1. We can use color to distinguish groups of data from each other.
  2. We can use color to represent data values.
  3. We can use color to highlight.

The types of colors we use and the way in which we use them are quite different for these three cases.

Color as a tool to distinguish

Color to represent data values

Color as a tool to highlight

ColorBrewer

Cynthia Brewer, a cartographer at Pennsylvania State University, designed the widely used color schemes ColorBrewer.

You can use the interactive web tool ColorBrewer 2.0 to choose an appropriate color scheme for your needs.

To use these color schemes in R, install the package RColorBrewer.

Colorblind safe figures

If you are not suffering from a color vision deficiency, it is very hard to imagine how it looks like to be colorblind.

The Color Blindness Simulator can close this gap for you. Just play around with it check whether your figures are colorblind safe.

unibeCols

The University of Bern has a set of corporate design colors that are defined in the manual “Gestaltungselemente”.

Thanks to Alan, you can easily install this color scheme with the unibeCols package: https://github.com/CTU-Bern/unibeCols

Visualizing (many) distributions

Visualizing (many) distributions

Visualizing geospatial data

Visualizing uncertainty

Visualizing uncertainty

Whenever you visualize uncertainty with error bars, you must specify what quantity and/or confidence level the error bars represent.

The principle of proportional ink

When a shaded region is used to represent a numerical value, the area of that shaded region should be directly proportional to the corresponding value. - Bergstrom & West

The principle of proportional ink

When a shaded region is used to represent a numerical value, the area of that shaded region should be directly proportional to the corresponding value. - Bergstrom & West

Handling overlapping data points

Handling overlapping data points

Don’t go 3D

Even though the 3D visualizations are shown from four different perspectives, it is difficult to envision how exactly the points are distributed in space.

Don’t go 3D

Instead, map one of the variables (in this case fuel efficiency) onto another aesthetic (size of the dots).

Commonly used image file formats

Acronym Name Type Application
pdf Portable Document Format vector general purpose
eps Encapsulated PostScript vector general purpose, outdated; use pdf
svg Scalable Vector Graphics vector online use
png Portable Network Graphics bitmap optimized for line drawings
jpeg Joint Photographic Experts Group bitmap optimized for photographic images
tiff Tagged Image File Format bitmap print production, accurate color reproduction
raw Raw Image File bitmap digital photography, needs post-processing
gif Graphics Interchange Format bitmap outdated for static figures, Ok for animations

Base plot vs. ggplot

plot(mtcars$disp, mtcars$hp,
     xlab = "displacement (cu. in.)",
     ylab = "power (hp)",
     main = "Scatter plot in base plot")

library(ggplot2)

ggplot(mtcars, aes(x = disp, y = hp)) +
    geom_point() +
    xlab("displacement (cu. in.)") +
    ylab("power (hp)") +
    ggtitle("Scatter plot in ggplot")

Data exploration and visualization with ggplot2

Artwork by @allison_horst

Program for the rest of the afternoon

  • General idea of using ggplot2
  • Basic graphs: geom_point, geom_line and geom_col (60 min)
  • Fancify basic graphs: colors, legend, axes, theme and patchwork (60 min)
  • Other types of geom: histogram, density, violin, boxplot (50 min)

Data visualization with ggplot2

Based on the grammar of graphics, a conceptual approach to building graphs from layers.

Pass a dataframe, map variables to aesthetics (e.g. y, x, colour), tell it which geometry to use (e.g. point, line)

2023 - R for the Rest of Us

Types of layers

  • Geometries: Representation of data
  • Scales: Defining axes and legends
  • Labels: Adding descriptive text, e.g, axes labels
  • Themes: General appearance of plot

Example

Cheatsheet

Cheatsheet

Example data COVID-19

covid <- read.csv("data/raw/COVID19Cases_geoRegion.csv")
covid <- covid %>% mutate( datum = as.Date(datum) ) 

head(covid)
  geoRegion      datum entries sumTotal timeframe_14d timeframe_all
1        CH 2020-02-24       1        1         FALSE          TRUE
2        CH 2020-02-25       1        2         FALSE          TRUE
3        CH 2020-02-26      10       12         FALSE          TRUE
4        CH 2020-02-27      10       22         FALSE          TRUE
5        CH 2020-02-28      10       32         FALSE          TRUE
6        CH 2020-02-29      13       45         FALSE          TRUE
  offset_last7d sumTotal_last7d offset_last14d sumTotal_last14d offset_last28d
1       4385008               0        4383801                0        4376250
2       4385008               0        4383801                0        4376250
3       4385008               0        4383801                0        4376250
4       4385008               0        4383801                0        4376250
5       4385008               0        4383801                0        4376250
6       4385008               0        4383801                0        4376250
  sumTotal_last28d sum7d sum14d mean7d mean14d entries_diff_last_age     pop
1                0    NA     NA     NA      NA                     7 8738791
2                0    NA     NA     NA      NA                     7 8738791
3                0    NA     NA     NA      NA                     7 8738791
4                0    NA     NA   8.14      NA                     7 8738791
5                0    NA     NA  12.29      NA                     7 8738791
6                0    NA     NA  16.86      NA                     7 8738791
  inz_entries inzsumTotal inzmean7d inzmean14d inzsumTotal_last7d
1        0.01        0.01        NA         NA                 NA
2        0.01        0.02        NA         NA                 NA
3        0.11        0.14        NA         NA                 NA
4        0.11        0.25      0.09         NA                 NA
5        0.11        0.37      0.14         NA                 NA
6        0.15        0.51      0.19         NA                 NA
  inzsumTotal_last14d inzsumTotal_last28d inzsum7d inzsum14d sumdelta7d
1                  NA                  NA       NA        NA         NA
2                  NA                  NA       NA        NA         NA
3                  NA                  NA       NA        NA         NA
4                  NA                  NA       NA        NA         NA
5                  NA                  NA       NA        NA         NA
6                  NA                  NA       NA        NA         NA
  inzdelta7d         type type_variant             version datum_unit
1         NA COVID19Cases           NA 2023-01-24_06-03-16        day
2         NA COVID19Cases           NA 2023-01-24_06-03-16        day
3         NA COVID19Cases           NA 2023-01-24_06-03-16        day
4         NA COVID19Cases           NA 2023-01-24_06-03-16        day
5         NA COVID19Cases           NA 2023-01-24_06-03-16        day
6         NA COVID19Cases           NA 2023-01-24_06-03-16        day
  entries_letzter_stand entries_neu_gemeldet entries_diff_last
1                     1                    0               914
2                     1                    0               914
3                    10                    0               914
4                    10                    0               914
5                    10                    0               914
6                    13                    0               914
dim(covid)
[1] 30247    36

Data from the COVID-19 BAG dashboard: https://www.covid19.admin.ch/

dataframe setup: covid_cantons_2020

# filter data frame covid: 
# only keep confirmed cases in the cantons of Zurich, Bern and Vaud 
# in the first half of the year 2020
covid_cantons_2020 <- covid %>% filter(datum <= as.Date("2020-06-30") 
                    & (geoRegion == "ZH" | geoRegion == "BE" | geoRegion == "VD"))

# write data frame covid_cantons_2020 to a csv file
write.csv(x = covid_cantons_2020, file = "data/processed/covid_cantons_2020_06.csv")
  geoRegion      datum entries sumTotal timeframe_14d timeframe_all
1        BE 2020-02-24       0        0         FALSE          TRUE
2        BE 2020-02-25       0        0         FALSE          TRUE
3        BE 2020-02-26       0        0         FALSE          TRUE
4        BE 2020-02-27       1        1         FALSE          TRUE
5        BE 2020-02-28       0        1         FALSE          TRUE
6        BE 2020-02-29       1        2         FALSE          TRUE
  offset_last7d sumTotal_last7d offset_last14d sumTotal_last14d offset_last28d
1        507985               0         507871                0         507046
2        507985               0         507871                0         507046
3        507985               0         507871                0         507046
4        507985               0         507871                0         507046
5        507985               0         507871                0         507046
6        507985               0         507871                0         507046
  sumTotal_last28d sum7d sum14d mean7d mean14d entries_diff_last_age     pop
1                0    NA     NA     NA      NA                     7 1047473
2                0    NA     NA     NA      NA                     7 1047473
3                0    NA     NA     NA      NA                     7 1047473
4                0    NA     NA   0.29      NA                     7 1047473
5                0    NA     NA   0.86      NA                     7 1047473
6                0    NA     NA   1.29      NA                     7 1047473
  inz_entries inzsumTotal inzmean7d inzmean14d inzsumTotal_last7d
1         0.0        0.00        NA         NA                 NA
2         0.0        0.00        NA         NA                 NA
3         0.0        0.00        NA         NA                 NA
4         0.1        0.10      0.03         NA                 NA
5         0.0        0.10      0.08         NA                 NA
6         0.1        0.19      0.12         NA                 NA
  inzsumTotal_last14d inzsumTotal_last28d inzsum7d inzsum14d sumdelta7d
1                  NA                  NA       NA        NA         NA
2                  NA                  NA       NA        NA         NA
3                  NA                  NA       NA        NA         NA
4                  NA                  NA       NA        NA         NA
5                  NA                  NA       NA        NA         NA
6                  NA                  NA       NA        NA         NA
  inzdelta7d         type type_variant             version datum_unit
1         NA COVID19Cases           NA 2023-01-24_06-03-16        day
2         NA COVID19Cases           NA 2023-01-24_06-03-16        day
3         NA COVID19Cases           NA 2023-01-24_06-03-16        day
4         NA COVID19Cases           NA 2023-01-24_06-03-16        day
5         NA COVID19Cases           NA 2023-01-24_06-03-16        day
6         NA COVID19Cases           NA 2023-01-24_06-03-16        day
  entries_letzter_stand entries_neu_gemeldet entries_diff_last
1                     0                    0                75
2                     0                    0                75
3                     0                    0                75
4                     1                    0                75
5                     0                    0                75
6                     1                    0                75

Goal 1 (exercise 4)

geom_point: basic plot

library(ggplot2)

plot_covid_point_v0 <- ggplot(data = covid_cantons_2020, 
                              mapping = aes(x = datum, y = entries)) + 
  geom_point()

Note: does not use the %>% or |> pipes, it uses + instead…

geom_line: basic plot

plot_covid_line_v0 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion))

geom_col: basic plot

plot_covid_col_v0 <- ggplot(data = covid_cantons_2020, 
                            mapping = aes(x = datum, y = entries)) + 
  geom_col(position = "stack")

Exercise 4A: basic plot

  1. Read Ebola data and sort it by date.
  2. Determine what variables you need to include in your dataframe to make the type of plot shown below.
  3. Create a dataframe with the required variables and all data for 3 countries before 31 March 2015.

Exercise 4B: basic plot

Create basic point, line and column plots of the cumulative number of confirmed cases versus time.

ggsave: saving your plot

# Save the plot as a PNG using ggsave
ggsave("plot_covid_point_goal.png", plot = plot_covid_point_goal, width = 8, height = 6, units = "in", dpi = 300)

# Save the plot as a PDF using ggsave
ggsave("plot_covid_point_goal.pdf", plot = plot_covid_point_goal, width = 8, height = 6)

ggsave: saving your plot

# Save the plot as a PNG using ggsave
ggsave("plot_covid_point_goal.png", plot = plot_covid_point_goal, width = 8, height = 6, units = "in", dpi = 300)

# Save the plot as a PDF using ggsave
ggsave("plot_covid_point_goal.pdf", plot = plot_covid_point_goal, width = 8, height = 6)

Try this for your own plot.

geom_point: colour and fill

plot_covid_point_v1 <- ggplot(data = covid_cantons_2020, 
                              mapping = aes(x = datum, y = entries)) + 
  geom_point(alpha = 0.7, colour = "black", fill = "blue", 
             shape = 21, size = 1.5, stroke = 1.5)

geom_line: colour and fill

plot_covid_line_v1 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion), 
            alpha = 0.7, colour = "blue", linetype = "solid", linewidth = 1.5)

geom_col: colour and fill

plot_covid_col_v1 <- ggplot(data = covid_cantons_2020, 
                            mapping = aes(x = datum, y = entries)) + 
  geom_col(position = "stack", alpha = 0.7, fill = "blue", 
           linetype = "solid", linewidth = 0.5, width = 0.7)

Exercise 4C: colour and fill

Change global aesthetics of the 3 plots you created in Exercise 4B.

  1. Point plot: Try different values for alpha, colour, fill, shape, size and stroke.
  2. Line plot: Try different values for alpha, colour, linetype and linewidth.
  3. Column plot: Try different values for alpha, colour, fill, linetype, linewidth, position and width.

geom_point: color per country

plot_covid_point_v2 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5)

Global vs. local aesthetics

ggplot(data = covid_cantons_2020, 
      mapping = aes(x = datum, y = entries, colour = geoRegion, 
                    fill = geoRegion, group_by = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  geom_line()

Global vs. local aesthetics

ggplot(data = covid_cantons_2020, 
                mapping = aes(x = datum, y = entries, group_by = geoRegion)) + 
  geom_point(alpha = 0.7, colour = "black", fill= "black", shape = 21, 
             size = 1.5, stroke = 1.5) +
  geom_line(colour = "red")

More examples on local vs. global aesthetics

geom_line: color per country

plot_covid_line_v2 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5)

geom_col: color per country

plot_covid_col_v2 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7, 
           linetype = "solid", linewidth = 0.5, width = 0.7)

Exercise 4D: color per country

Change aesthetic mappings of the 3 plots you created in Exercise 4C.

  1. Point plot: Set fill colour of points by country.
  2. Line plot: Set colour of lines by country.
  3. Column plot: Set fill colour of columns by country.

geom_point: labels

plot_covid_point_v3 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_line: labels

plot_covid_line_v3 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_col: labels

plot_covid_col_v3 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7,
           linetype = "solid", linewidth = 0.5, width = 0.7) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

Exercise 4E: labels

Change the title and the labels of the axes of the 3 plots you created in Exercise 4D.

  1. Set the title to “Confirmed Ebola cases”.
  2. Set the label of x-axes to “Time”.
  3. Set the label of y-axes to “Cum. # of confirmed cases”.

geom_point: change standard colors

library(unibeCols)

plot_covid_point_v4 <- ggplot(data = covid_cantons_2020, 
    mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
    scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_point: change standard colors

geom_line: change standard colors

plot_covid_line_v4 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5) +
  scale_colour_manual(name = "Canton",
                      breaks = c("BE", "VD", "ZH"),
                      values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                      labels = c("Bern", "Vaud", "Zurich")) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_line: change standard colors

geom_col: change standard colors

plot_covid_col_v4 <- ggplot(data = covid_cantons_2020, 
                            mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7,
           linetype = "solid", linewidth = 0.5, width = 0.7) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_colour_manual(name = "Canton",
                      breaks = c("BE", "VD", "ZH"),
                      values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                      labels = c("Bern", "Vaud", "Zurich")) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_col: change standard colors

Exercise 4F

Change the colour, respectively fill, scale of the three plots you created in Exercise 4E.

  1. Point plot: Change fill scale manually.
  2. Line plot: Change colour scale manually.
  3. Column plot: Change fill scale manually.

geom_point: scales

plot_covid_point_v5 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", 
                                  "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_point: scales

geom_line: scales

plot_covid_line_v5 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5) +
  scale_colour_manual(name = "Canton",
                      breaks = c("BE", "VD", "ZH"),
                      values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                      labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_line: scales

geom_col: scales

plot_covid_col_v5 <- ggplot(data = covid_cantons_2020, 
      mapping = aes(x = datum, y = entries, fill = geoRegion, group=geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7,
           linetype = "solid", linewidth = 0.5, width = 0.7) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 600, by = 100),
                     limits = c(0, 600)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases")

geom_col: scales

Exercise 4G: scales

Change the scale of the axes of the three plots you created in Exercise 5.

  1. Point plot: Change breaks of x-axes to 29 August, 1 October, 1 December, 1 February, and 1 April.
  2. Line plot: Change breaks of y-axes of point and line plot to 0, 2500, 5000, 7500 and 10000.
  3. Column plot: Change breaks of y-axis of column plot to 0, 2500, 5000, 7500, 10000, 12500 and 15000.

Themes

Graphic from https://www.geeksforgeeks.org/themes-and-background-colors-in-ggplot2-in-r/

geom_point: themes

plot_covid_point_v6 <- ggplot(data = covid_cantons_2020, 
  mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
    scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom")

geom_point: themes

geom_line: themes

plot_covid_line_v6 <- ggplot(data = covid_cantons_2020, 
                             mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5) +
  scale_colour_manual(name = "Canton",
                      breaks = c("BE", "VD", "ZH"),
                      values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                      labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom")

geom_line: themes

geom_col: themes

plot_covid_col_v6 <- ggplot(data = covid_cantons_2020, 
    mapping = aes(x = datum, y = entries, fill = geoRegion, colour=geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7,
           linetype = "solid", linewidth = 0.5, width = 0.7) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 600, by = 100),
                     limits = c(0, 600)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom")

geom_col: themes

Exercise 4H: themes

Change the theme of the three plots you created in Exercise 4G to theme_bw().

geom_point: facet

plot_covid_point_facet <- ggplot(data = covid_cantons_2020, 
   mapping = aes(x = datum, y = entries, fill = geoRegion, colour=geoRegion)) + 
  geom_point(alpha = 0.7, shape = 21, size = 1.5, stroke = 1.5) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
    scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom") +
  theme(panel.spacing = unit(2, "lines")) +
  facet_grid(cols = vars(geoRegion))

geom_point: facet

geom_line: facet

plot_covid_line_facet <- ggplot(data = covid_cantons_2020, 
                                mapping = aes(x = datum, y = entries)) + 
  geom_line(mapping = aes(group = geoRegion, colour = geoRegion), 
            alpha = 0.7, linetype = "solid", linewidth = 1.5) +
  scale_colour_manual(name = "Canton",
                      breaks = c("BE", "VD", "ZH"),
                      values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                      labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 350, by = 50),
                     limits = c(0, 350)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom") +
  theme(panel.spacing = unit(2, "lines")) +
  facet_grid(cols = vars(geoRegion))

geom_line: facet

geom_col: facet

plot_covid_col_facet <- ggplot(data = covid_cantons_2020, 
 mapping = aes(x = datum, y = entries, fill = geoRegion, colour = geoRegion)) + 
  geom_col(position = "stack", alpha = 0.7,
           linetype = "solid", linewidth = 0.5, width = 0.7) +
  scale_fill_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_colour_manual(name = "Canton",
                    breaks = c("BE", "VD", "ZH"),
                    values = c(unibeRedS()[1], unibeMustardS()[1], unibeIceS()[1]),
                    labels = c("Bern", "Vaud", "Zurich")) +
  scale_x_date(breaks = as.Date(c("2020-02-24", "2020-04-01", "2020-05-01", 
                                  "2020-06-01","2020-07-01")),
               labels = c("24 February", "1 April", "1 May", "1 June", "1 July"),
               limits = as.Date(c("2020-02-23", "2020-07-01"))) +
  scale_y_continuous(breaks = seq(from = 0, to = 600, by = 100),
                     limits = c(0, 600)) +
  ggtitle(label = "Confirmed covid cases in 3 cantons") +
  xlab(label = "Time") +
  ylab(label = "# of confirmed cases") +
  theme_bw() + theme(legend.position="bottom") +
  theme(panel.spacing = unit(2, "lines")) +
  facet_grid(cols = vars(geoRegion))

geom_col: facet

Exercise 4I: facet

Create facet grids by country from the three plots you created in Exercise 4H.

Patchwork

Artwork by @allison_horst

geom_point: grid

library(cowplot)

plot_covid_point_grid <- plot_grid(plotlist = list(plot_covid_point_v1, plot_covid_point_v2, plot_covid_point_v3, 
                                                   plot_covid_point_v4, plot_covid_point_v5, plot_covid_point_v6),
                                   labels = c("V1", "V2", "V3", "V4", "V5", "V6"), label_size = 12, nrow = 2)

Install cowplot:

install.packages("cowplot")

geom_point: grid

geom_line: grid

plot_covid_line_grid <- plot_grid(plotlist = list(plot_covid_line_v1, plot_covid_line_v2, plot_covid_line_v3, 
                                                  plot_covid_line_v4, plot_covid_line_v5, plot_covid_line_v6),
                                  labels = c("V1", "V2", "V3", "V4", "V5", "V6"), label_size = 12, nrow = 2)

geom_line: grid

geom_col: grid

plot_covid_col_grid <- plot_grid(plotlist = list(plot_covid_col_v1, plot_covid_col_v2, plot_covid_col_v3, 
                                                 plot_covid_col_v4, plot_covid_col_v5, plot_covid_col_v6),
                                 labels = c("V1", "V2", "V3", "V4", "V5", "V6"), label_size = 12, nrow = 2)

geom_col: grid

Exercise 4J: grid

Arrange six of the plots you created in the previous exercises into a grid.

Types of geom

Example data: insurance

insurance <- read.csv("data/raw/insurance_with_date.csv")
insurance <- insurance %>% mutate(children = as.factor(children))

head(insurance)
  X age    sex    bmi children smoker    region   charges       date
1 1  59   male 31.790        2     no southeast 13086.341 2001-01-15
2 2  24 female 22.600        0     no southwest  2574.268 2001-01-17
3 3  28 female 25.935        1     no northwest  4411.400 2001-01-22
4 4  22   male 25.175        0     no northwest  2321.417 2001-01-29
5 5  60 female 36.005        0     no northeast 13434.551 2001-02-06
6 6  38 female 28.000        3     no southwest  7262.940 2001-02-17
dim(insurance)
[1] 1338    9

Data adapted from “Machine Learning with R” by Brett Lantz.

Density plot / histogram

Exercise 5A: Can you reproduce these graphs using the insurance.csv dataset?

Quantiles

Excersize 5B: Can you reproduce this graph using the insurance.csv dataset?

violin plot / boxplot

Excersize 5C: Can you reproduce these graphs using the insurance.csv dataset?

Cheatsheet

Cheatsheet

Practice makes perfect

Community driven projects for practicing

Images by @tanyashapiro, @gkaramanis, @cscherer